This portfolio will look at the connections between academic institutions on Github. The data for this portfolio was extracted from Google’s public data warehouse “github-repos” and pre-processed using Python (R lacks the data structure for efficient pre-processing). The Python code for pre-processing is attached at the end of this report.
The first graph, “Connections Between Institutions”, tries to answer the question of how academic institutions collaborate with each other. To this end, this graph used a node-link diagram to picture the broad GitHub collaboration landscape where each node is an academic institution, and there is an edge between two institutions if they have contributed to the same repository. There are 3 notable design choices to encode more information: 1) nodes are filled with colors that correlate to how many contributers there are: the brighter an institution is, the more contributers it has 2) edges are drawn in colors that correlate to how many shared contributers there are between two institutions: the brigher an edge is, the more shared contributers there are. 3) the graph is visualized with the “kk” layout, where institutions that connect to many other institutions are at the center of the graph, leaving those with fewer connections (githubby socially inactive) to the periphery.
According to this design choice, the audience can easily spot the more active collaborators by looking at the center of the graph, tell which institution has more committers by looking at its color, and find all collaborating institutions for one node-of-interest by following the edges extending out of it.
For example, we can see that universities like Wisc, MIT, Middlebury, UW, UChicago, etc. are the most collaborative players in the open-source world. On the other hand, universities like toronto, txstate, virginia are a bit socially awkward around Github.
Looking at UW-Madison (labeled in red), it is nearly right at the center of the entire graph, meaning it is one of the most socially pro-active collaborator in the GitHub open-source community. In addition, the color of Wisc is darker, meaning there are not many contributers from Wisc. This means that contributers at UW-Madison, on average, are a lot more collaborative than people in other institutions.
network <- read_csv("data/network.csv")
## Rows: 139970 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): src_committer, src_institution, dst_committer, dst_institution
## dbl (4): src_num_commits, src_total_commits, dst_num_commits, dst_total_commits
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# prepare data for the connections across institutions
institution_network <- network %>%
group_by(src_institution, dst_institution) %>%
summarize(shared_committers=n()) %>%
filter(src_institution != dst_institution)
## `summarise()` has grouped output by 'src_institution'. You can override using
## the `.groups` argument.
# create a tbl_graph object
# load commits by institution data
vertices <- read_csv("data/479_tidy_data.csv") %>%
group_by(institution) %>%
summarize(num_committers = n()) %>%
drop_na() %>%
rename(name = institution)
## New names:
## * `` -> ...1
## Rows: 108153 Columns: 5── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, email, institution
## dbl (2): ...1, num_commit
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
edges <- data.frame(
source = institution_network$src_institution,
target = institution_network$dst_institution,
edge_committers = institution_network$shared_committers
)
# only preserve edges with at least 4 connections
edges <- edges %>%
group_by(source) %>%
mutate(num_connections = n()) %>%
ungroup() %>%
filter(num_connections > 3)
# only preserve vertices that are both in source and target
vertices <- vertices %>%
filter(name %in% edges$source & name %in% edges$target)
# only preserve edges that are in vertices
edges <- edges %>%
filter(source %in% vertices$name & target %in% vertices$name)
G <- tbl_graph(nodes=vertices, edges = edges, directed = FALSE)
G <- G %>%
mutate(node_weight = runif(num_committers)) %>%
activate(edges) %>%
mutate(edge_weight = runif(edge_committers))
# visualize the tbl_graph object
ggraph(G, layout = 'kk') +
geom_edge_link(aes(col = edge_weight, width=edge_weight), alpha=0.5) +
scale_edge_width_continuous(range = c(0, 0.5)) +
geom_node_label(aes(label = name, fill=node_weight, color=ifelse(name == "wisc", "#ff0000", "#ffffff"))) +
coord_fixed() +
labs(title = "Connections Between Institutions") +
theme_void() +
theme(plot.title = element_text(size = 20, hjust = 0.5)) +
scale_fill_continuous(type = "viridis") +
scale_edge_colour_viridis() +
guides(edge_width = FALSE) +
scale_colour_identity()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.